clusTransition: An R package for monitoring transition in cluster solutions of temporal datasets

Clustering analysis’ primary purpose is to divide a dataset into a finite number of segments based on the similarities between items. In recent years, a significant amount of study has focused on the spatio-temporal aspects of clustering. However, clusters are no longer regarded as static objects since changes influence them in the underlying population. This paper describes an R package implementing the MONIC framework for tracing the evolution of clusters extracted from temporal datasets. The name of the package is clusTransition, which stands for Cluster Transition. The algorithm is based on re-clustering cumulative datasets that evolve at successive time-points and monitoring the transitions experienced by the clusters in these clustering solutions. This paper’s contribution is to demonstrate how the package clusTransition is developed in the R programming language, and its workflow is discussed using hypothetical and real-life datasets.


Introduction
The prime goal of clustering analysis is the organization of a dataset into a finite number of segments according to the similarities within objects. Ideally, the set of objects in the same segment should be comparably similar to one another than to the objects belonging to different partitions [1]. Each individual partition is known as a cluster, whereas the objects belonging to the same cluster are called its members [2,3]. Its applications covers many real-world applications, ranging from business and economics, marketing, pattern recognition, medical sciences, image processing to big data analysis [4]. For example, in the field of market segmentation, better marketing strategies can be adopted by clustering the customers with similar demographic or buying characteristics [5]. In a similar passion, clustering might be helpful in better understanding the disease and targeting appropriate treatment by subgrouping the patients into homogeneous sets based on psychological inventory scores [6]. Since the notion of clustering is not precisely described, consequently, several algorithms/ models have been proposed in the literature, and all of them may result in well different clustering solutions [7,8]. It is already true that in recent years, a considerable amount of research work conducted is based on investigating the spatio-temporal properties of clustering. In these applications, clusters are no longer considered static objects, as they are affected by changes occurring in the underlying population [9,15]. The inclusion of new data records to the original population over time may affect the cluster's memberships, and entirely different clustering solutions may be generated at later time-points. This transition in clustering solutions may include the disappearing of a specific cluster(s), migration of some elements from one cluster to another, splitting of a cluster into several, several clusters splicing together to form one, survival of a cluster and emerging of new ones. The survived clusters can experience internal transition, including changes in location, size, and density [10,11]. Various topics such as spatio-temporal, evolutionary, stream, and incremental clustering address this issue by adopting the dataset that changes over time. Tracing and understanding the phenomena behind this transition is of practical importance for effective decision-making. This can be helpful in various fields like marketing, fraud detection, networking, scientific publication, health, etc. [12].
In many real-world applications, clustering of the data stream is performed all time to identify the changes occurring in the pattern of the underlying phenomena [13]. Since in a stream, new data items are continually generated, which join the underlying population at a regular interval. Therefore, in order to control part of the data that contributes to the pattern in data mining, the stream needs to be discretized into subsets based on some attributes that have an order. This data discretization into subsets is called the windowing approach and is mainly done based on time. Some of the most commonly used examples are landmark, sliding, and damped window models [14]. These models are discussed in the next section. [15] introduces the notion of evolutionary clustering to process the times-tamped dataset by producing a sequence of clustering solutions. That is, a clustering solution for each timestep of the temporal data. The algorithm optimizes two competing criteria i.e. each clustering in the sequence should be similar to the clustering at the previous time-step, while at the same time should accurately reflect the data arriving during that time-step. This framework is further extended to spectral clustering [16], density-based clustering [17], and Hierarchical Dirichlet Process with the Hidden Markov model [18].
Using a totally online method, Hyde et al. [19] offer an algorithm that clusters the evolving data streams into arbitrary shaped clusters. The approach consists of two stages: the first stage finds micro-clusters in the datasets, and the second step merges these microclusters into macro-clusters. In a similar vein, Fahy et al. [20] describe an Ant Colony Stream Clustering technique built on a density-based methodology that recognises clusters as a collection of micro-clusters. To read a stream and create micro-clusters in the window pane, the method uses a tumbling window model. By combining the related clusters based on a similarity index, these clusters are then further refined. Fahy and Yang [21] further enhance this technique to address the multi-density issue in the density-based clustering strategy. This method uses the local radius of each cluster to identify clusters, and it then tracks changes in the solutions. For the first time, multiple view clustering challenges are addressed by Huang et al. [22] in MVStream clustering method. In order to assign cluster labels to the data items that include summary statistics, this technique creates support vectors from various views of the data objects. Similarly, some studies have been conducted for measuring the similarities between the trajectory in the dynamic environment [23][24][25].

Window models
In a landmark window model, all items that arrive after some specific time-point (landmark time) are maintained and cannot be discarded irrespective of window size. The window size is uncontrolled and keeps increasing as time progresses [26,27]. The data records that arrives in the interval (t i−1 , t i ) are accumulated according to the equation given by: where n is the number of time-points and t is the current time-point. Implementation of the landmark window model will generate n window panes, where each pane contain data items evolving from starting time-point t 1 to the current time-point t i . The sliding window model, on the other hand, is based on a fixed size of window w that contains only those objects falling in the interval [t i − w + 1, t i ], while older cases are discarded. In such type of model, as time progress, the window slides forward while keeping its size w by including new data records and discarding the older ones [27,28]. The scenario of the sliding window model can be described in the equation below: . . .
where m is the number of window panes and is equal to n − w + 2, n is the number of timepoints, and w is the sliding window size.

The change detection algorithm
In order to monitor and trace the evaluation of clusters extracted from re-clustering of cumulative datasets [29] introduced a framework known as 'MONIC' algorithm. This algorithm is based on clustering cumulative datasets arriving at discrete time-points t 1 , t 2 , . . ., t n . Initially, the data is collected at time-point t 1 , and as time progresses new data records join the data set at regular interval of time. The initial datasets d 1 , d 2 , . . .., d n , are accumulated and re-clustered at each time-point t 1 , t 2 , . . ., t n to monitor and detect the cluster evolution over time.
The algorithm is mainly based on the idea of a non-symmetric overlap matrix between two clustering extracted from cumulative datasets at two different time-points. Let x i ¼ fX 1 ; X 2 ; . . . ; X k 1 g be a set of clusters extracted from dataset D i at time point t i and is referred to as first clustering. Similarly, let x j ¼ fY 1 ; Y 2 ; . . . ; Y k 2 g be a set of clusters extracted from dataset D j at time point t j (i<j) and is referred to as second clustering. Then the overlap matrix can be defined as: where k 1 is the number of clusters from the first clustering ξ i , and k 2 is the number of clusters from second clustering ξ j . This will generate a matrix of order k 1 � k 2 , where rows and columns describe first and second clustering respectively. The value on the corresponding element of the matrix represents the similarity index between cluster X i and Y j . The MONIC framework assumes hard clustering where each observation belongs to one and only one cluster [30].
In the context of this algorithm, the transition is the change experienced by a cluster X i � ξ i , when it has been perceived at second clustering ξ j . This change in the clustering solution is referred to as an external or internal transition. External transition concern the relationship of cluster found at clustering ξ i to the clusters found at clustering ξ j , whereas internal transition is regarded as changes that occurred in the structure of the survived clusters.
The external transition is categorized into five categories i.e. Survive, Merge, Split, Disappear, and Emerge candidates. The cluster X l � ξ i may survive into Y m � ξ j , clusters fX l 1 ; X l 2 g2 x i may merge to form Y m � ξ j , or cluster X l � ξ i may split into various daughter clusters fY m 1 ; Y m 2 g2 x j . If a cluster X l � ξ i does not experience any of the above transitions, then it disappears. Similarly, if a cluster Y m � ξ j is not a result of any external transition from its ancestors, then it is a newly emerged candidate. The overlap between X l � ξ i and Y m � ξ j serve as an indicator of identifying the external transition experienced by clusters at clustering ξ i . This value is compared with a minimum threshold value say τ�[0.5, 1] to identify match of X � ξ i in Y � ξ j . A cluster X l � ξ i is said to survive in Y m � ξ j if this is the only cluster that has an overlap of greater than τ survive . If at least two clusters from X � ξ i (such as X l 1 andX l 2 have an overlap of greater than τ survive with Y m � ξ j ), then it is a case of merge i.e. X 1 and X 2 merge to form Y m . Furthermore, a cluster is said to split in daughter clusters, if the overlap of X l with Y m 1 and Y m 2 is greater than τ split and collectively their overlap is greater than τ survive , i.e. for split the following two conditions are required.
OverlapðX l ; Y m Þ > t split m ¼ 1; 2; :::M ð7Þ where M is the number of daughter clusters from second clustering. The overlap can not be used as an indicator for monitoring the changes in the form of survived clusters. The shift in the location of the survived cluster (X l ! Y m ) can be traced by calculating Euclidean distance between their centroids normalized by the minimum radius. This information can be summarized in the following formula: where � X l and � Y m are the centroids of clusters X l and Y m respectively, and dð � X l ; � Y m Þ is the Euclidean distance between them. The r denotes radius of the corresponding clusters and is computed as the maximum distance of an object from its cluster centroid. If the absolute value of location.difference is greater than τ location , then the algorithm will detect a shift in location of the survived cluster.
For density transition, the average distance of objects from cluster centroid can be computed. The formula for the density of cluster is given by: The difference in density of cluster X l survived in Y m is normalized by the minimum radius i.e.
If the absolute value density.difference is less than τ density then there is no change in density of the survived cluster. On the other hand, if the absolute value is greater than τ density then a change in density would be detected. If density.difference is positive then the cluster is more compact than its ancestors, otherwise, it becomes more diffuse.

Package description
The state-of-the-art "MONIC" algorithm is implemented in the R-software via package clus-Transition. The package can be used for tracing and monitoring the evolution of clustering solutions in cumulative datasets over time. In this section, we briefly describe the functions and methods exported by the package in detail. Fig 1 below demonstrates the workflow of the package. Table 1 below summarizes the functions, methods, and classes exported by the package along with its corresponding arguments and slots.
More details about these functions and classes are described below.

Function Transition()
The evolution of clusters can be traced using the primary function Transition(), which exports an object of class S4. In implementing the package clusTransition, we have considered the portability of the functions for various types of hard clustering algorithms. A typical call to the Transition() function involves three essential pieces: the data input (listdata, listclus, overlap), choice of window swSize, and the threshold parameters. The user must only provide the swSize and k arguments in case of importing datasets using the listdata argument. This function has the following interface: >Transition(listdata, listclus = NULL, Overlap = NULL, swSize = 1, typeind = 1, We took into account the portability of the functions for many kinds of hard clustering algorithms while developing the clusTransition package. For this purpose three different options i.e. listdata, listclus, and Overlap are provided for importing the data.
The listdata imports the raw data stream at discrete time points t 1 , t 2 , . . ., t n . A sequence of cluster solutions are generated from the stream using k-means clustering

PLOS ONE
algorithm. Each element of the list corresponds to the dataset at a single time point. The number of clusters in each accumulative data matrix is specified by the argument k.
On the other hand, the listclus argument imports the clustering solutions at successive time-points to allow clusters other than k-means. Each element of listclus is a nested list that contain clustering solutions at corresponding time point i.e.
Overlap is a List of numeric matrices containing similarity measures between clusters extracted at consecutive time points. The similarity between clusters are computed using Eq 6. The Overlap method exported by the package can be used to compute the similarity matrices.
swSize indicates size of the sliding window model. The default value of swSize = 1 implements the landmark window model and discretize the stream according to Eq 1. Whereas other numeric values discretize the stream using a sliding window scenario according to Eq 5. The sliding window size can only be provided if listdata argument is chosen.
The survival_thresHold, split_thresHold, location_thresHold, and density_thresHold are minimum threshold value for survival of clusters from X�ξ i to Y�ξ j , split of cluster X�ξ i to {Y m1 , Y m2 }�ξ j , shift in location, and changes in density of survived clusters respectively. These are user defined parameters and belongs to the interval (0,1).
One of the most perplexing problems with most clustering algorithms is deciding the ideal number of partitions. This is a crucial parameter for partitioning, hierarchical and model-based clustering algorithms. The number of clusters one wants to generate from a dataset has to be predefined. There are several ways of estimating the optimal number of clusters k, such as the silhouette, Gap, and Elbow methods. k is a numeric vector containing the relevant number of clusters at the corresponding time-point. The length of k is to be determined from the swSize. This argument should only be provided if listdata argument is chosen.
Typing the object's name comprising the Transition() function's output will produce external and internal transition results at each time point. External transition includes the number of clusters still existent, absorbed by others, split into various, disappeared and newly emerged at second clustering. Internal transition comprises changes in the location and density of the survived clusters.
Along with this information, the Monic object holds the cluster's radius, membership, and distance between cluster centres.

OverLap class
This is an object of class OverLap that contains summaries of first and second clustering. This object has eight slots that work as input for tracking the evolution of clusters by the Transition() function. The slots include a numeric matrix containing the similarities between clusters generated at first and second clustering (Overlap computed from Eq 6), the cluster's membership vector, radius, centres, and an average distance of items from the cluster's centres (computed from Eq 10). In addition, this has the following interface: >obj <new("OverLap")

Overlap method
This method initializes the slots of an object having class OverLap by importing the clustering solution ξ of cumulative datasets D at two consecutive time points i and j. Clusters at each data point should be provided as a list of matrices, where each matrix contains a data set belonging to one cluster. It has the following interface. >Overlap <-Overlap(object, e1 = C1, e2 = C2) where e1 is the set of clusters x i ¼ X 1 ; X 2 ; :::; X k 1 obtained at time point t i from cumulative dataset D i , e2 is the set of clusters x j ¼ Y 1 ; Y 2 ; :::; Y k 2 obtained at time point t j from cumulative dataset D j , and object is an object of class OverLap.

Function moplot()
This method plot 3 bar-plot and 1 line graph. The first stack bar-plot shows SurvivalRatio and AbsorptionRatio, second bar-plot shows number of new emerged clusters at each time stamp, third bar-plot shows number of disappearance at each time stamp. The line graph shows passforward Ratio and SurvivalRatio.

Simulation example
Let us assumes that a data stream consist of datasets d 1 , d 2 , . . ., d n arriving at corresponding time-points t 1 , t 2 , . . ., t n respectively. For the generation of initial dataset d 1 , we use a generator that takes into account the number of clusters (k), size of each cluster, and separation value between theme [31]. While the generator for generating other streams like d 2 , d 3 , . . ., d n consider the center of each cluster, size of each cluster, and the co-variance structure between them as input [32,33]. As a working example, we generate a data stream sprouting at four consecutive time points.

Pre-processing
Prior to the implementation of the change detection algorithm in cluster solutions over time, the user needs to pre-specify some relevant parameters. First of all, the user needs to decide a suitable windowing approach for the accumulation of datasets evolving at successive time points. For this purpose, we offered two types of windowing approaches in the package i.e. landmark and sliding window models. Implementation of the windowing approach will accumulate the datasets at corresponding time points according to the chosen model and will generate window panes at successive time points. In the second phase, the optimal number of clusters in each window pane D i at the corresponding time point must be determined using an appropriate technique. For illustration purposes, we use worked examples based on the datasets simulated in section IV. The datasets are accumulated according to the landmark and sliding windowing approaches, and then the optimal number of clusters was estimated in each window pane D i .
The implementation of the landmark window model will produce four window panes. Each pane will contain the datasets generated between [t 1 , t i ], where t i represent the current time point. Table 2 below demonstrates the number of objects and optimal number of clusters in each window pane D i estimated from Gap statistics at corresponding time points t i .
Similarly, the implementation of a sliding window of size 3 will generate 3 window panes. Table 3 below demonstrates the number of objects and optimal number of clusters in each window pane D i .

Implementation of function Transition()
In this section implementation of the primary function, Transition() is presented using working examples. The data stream simulated in section 5 is used for monitoring the cluster evolution over time. The function provides three different options for importing the datasets, which are explained in subsections below.

Looking at listdata argument
The argument listdata is a list of matrices or data frames containing the datasets d 1 , d 2 , . . ., d n evolving at corresponding time-points t 1 , t 2 , . . ., t n . The i th element of the listdata comprises set of data items d i that evolve at corresponding time point t i . At this point the Transition() function accumulates the datasets d i according to the suitable windowing approach provided in swSize argument. The default value i.e Swsize = 1 will implement landmark window model, whereas other integer values implements sliding window model. The accumulation of datasets d i will generate window panes D i that contain cumulative datasets at successive time points. Each window pane D i will be re-clustered by using cclust() function from flexclust package [34]. The optimal number of clusters in cumulative datasets D i should be decided by the user and must be imported via argument k of the function. Both k and swSize arguments are used only if listdata option is chosen for importing datasets d i . The argument typeind = 1 allows the user to implement listdata argument. Monitoring and tracking the evolution of clusters using the landmark window model is shown in the example below.

Example (listdata argument with landmark window model).
The default value of swSize = 1 implements the landmark window model and generates n window panes of cumulative datasets D i according to Eq 1. In this working example, the datasets generated in section 5 is used. According to Table 2   This will generate two tables, displaying the number of clusters experiencing external and internal transition at successive time points. The first table in the output comprises the number of clusters that experience external transitions at corresponding time points t j . Similarly, the second table comprises the number of survived clusters that undergone internal transitions at corresponding time points. Hence the full summary of external and internal transitions are shown below.
The object clusterTrace returned by the Transition() function is an object of class S4, named Monic. The object contains the candidates that experience external and internal transitions at successive time points. The slots ending with x represent candidates that adopt external transitions from first clustering ξ i . Whereas the slots ending with y represent the candidates that evolve as a result of corresponding external transition at second clustering ξ j . For example, the candidates that experience external transitions at time point t 3 can be retrieved as: Available components: ===================== "SurvivalCanx" "SurvivalCany" "SplitCanx" "SplitCany" "MergeCanx" "MergeCany" "EmergCan" "ShiftLocCan" "NoShiftLocCan" "MoreCompactCan" "MoreDiffuseCan" "NoChangeCompactCan" "Centersx" "Centersy" "clusterMem" "avgDisx" "avgDisy" "rx" "ry" "SurvivalRatio" "AbsorptionRatio" "passforwardRatio" Let C im �ξ i (first clustering) be the cluster that experience some external transition and evolve as C jn �ξ j (second clustering). Where the first subscript (i and j) represent time point and second subscript (m and n) represent the cluster number. The Time Step [3]] in the output represents the time point t j at second clustering, and hence the time point t i (i = j − 1) at first clustering ξ i is one less. So in this particular example i = 2 and j = 3, then the above transition can be summarized as: The algorithm detect that three clusters survive (C 21 !C 31 , C 23 !C 34 , and C 24 !C 32 ) and one cluster split (C 22 !{C 33 , C 35 }).

Looking at listclus argument
The listdata argument permit the users to implement un-clustered datasets d 1 , d 2 , . . ., d n arrives at time-points t 1 , t 2 , . . ., t n . However, this restricts the package to only one type of clustering algorithm i.e. k-means algorithm. In order to make the package more flexible for other types of hard clustering, an alternate argument listclus is provided in the function. The listclus argument imports clustering solutions of each window pane as a list i.e. listclus = {ξ 1 , ξ 2 , . . ., ξ n } and compute the similarity indices between them. The argument listclus is a list, where every individual element is a nested list of matrices or data-frames. The i th element corresponds to the set of clusters x i ¼ fX 1 ; X 2 ; . . . ; X k i g extracted at time-point t i , by implementation of an appropriate clustering algorithm to window pane D i . This is explained in the example given below.

Example: Listclus argument.
Prior to applying Transition() function, the user need to extract clusters from each window pane D i . For this purpose, first of all, accumulate the initially collected datasets d 1 , d 2 , . . ., d n , according to a suitable window model like landmark in this example. This can be done by explicitly calling merge() function from base package. By running the R codes given below will generate 4 panes.

Looking at Overlap argument
The Overlap argument also permits the user to implement other types of clustering algorithms and trace the evolution of clusters over time. Overlap argument imports a list of objects as produced by the Overlap() method that contain similarity between clustering obtained at successive time points t i and t j (i < j) and the summaries of these clusters. This can be implemented by setting typeind = 2. The overlap matrices can be computed by utilizing the S4 method overlap() exported by the clusTransition package. In the same way as listclus, some clustering algorithm can be applied to landmark or sliding window modeled dataset to extract the cluster memberships at corresponding time-points. List of clusters extracted from D i and D i−1 can be used to compute the overlap matrix between clustering. This is elaborated in the working example given below.  Consequently, no cluster disappears and no newly emerged candidate were detected at any of the time points. This can be seen from pass-forward ratio, which is unity at all time points except t 2 where one cluster splits into daughter candidates.

Real data example
To demonstrate the practicality of the package and deeply understand applications of cluster evolution, we investigate three real-life datasets. To comprehend the notion of transformation in social, political, and moral attitudes of European nations; the Human Values datasets were extracted from European Social Surveys [35]. The changes in electricity consumption of inhabitants were traced using Individual Household Electricity Consumption dataset. Similarly, the Intel Lab sensors streaming dataset was used to show the applications of the framework. Both these data streams were extracted from the home page of "UCL Machine Learning Repository".

Application to human values scale
As a case study, we extract eight datasets each corresponds to a single round of European Social Surveys (ESS) conducted in years 2002, 2004, 2006, 2008, 2010, 2012, 2014, and 2016 respectively. The dataset consist of 25024 individuals who respond to the Schwartz Value Survey (SVS) for computing basic human values and can be downloaded from the URL https://ess-search.nsd.no/CDW/ConceptVariables. The ten basic values are Benevolence, Universalism, Self-direction, security, Confirmatory, Hedonism, Achievements, Traditions, Stimulation, and Power [35]. The k-means clustering algorithm was implemented to sliding window-modeled datasets at each time point. Whereas, the number of clusters in the respective datasets was estimated from the well-known GAP statistic. Fig 4 below describe the evolution of clusters at time point t i , i = 1, 2, 3, 4, 5, 6, 7 in Human Value scale datasets. which demonstrates that two clusters C 11 and C 12 survived over time. The first imperative cluster was C 11 (C 11 !C 22 !C 32 !C 42 ) that emerged at t 1 (2002) and survived until t 4 (2006, 2010). However, the cluster survived till 2010, but experienced internal transition and became more diffused eventually disappeared at time-point t 5 . The second vibrant cluster was C 12 (C 12 !C 24 !C 33 !C 41 !C 52 !C 63 !C 71 ) which survive through the entire time span. This was the most important cluster because not only it survives over time but also turns out to be denser. Mostly the new respondents of SVS surveys over the years joins this cluster. The shift in location was observed for this cluster at time-point t 2 and t 3 , and afterward, remain stable. The first external transition was experienced in the cluster C 14 which split into two clusters and ultimately disappeared. The algorithm also detects a cluster C 61 that emerged at t 6 (2010, 2014) and pass-forward while absorbing elements of the cluster C 62 .

Application to Individual Household Electric Power Consumption
As a second example, the Individual Household Electric Power Consumption dataset for the years [2006,2010] was used. This dataset comprises of 2075259 households characterized by seven numerical attributes. The dataset is available at machine learning repository [36] and can be downloaded from https://archive.ics.uci.edu/ml/datasets/individual+household +electric+power+consumption. A sliding window model of size 2 was used for accumulation of the stream at successive time points. In this section, we use the CLARA algorithm to extract clusters from the datasets at successive time points. Whereas the average silhouette method was used to estimate the optimal value of k in each window pane. Fig 5 below demonstrates the evolution of clusters at time point t i , i = 1, 2, 3, 4, 5 in individual household electric power consumption datasets. The algorithm detect that all of the four clusters survive (C 11 !C 21 , C 12 !C 21 , C 13 !C 23 , and C 14 !C 24 ) experiencing internal transition and became diffuse during [2006,2007]. A shift in location for only one cluster C 13 was detected, whereas other clusters were stable to change in location. Similarly, three clusters survive (C 21 !C 31 , C 22 !C 33 , and C 24 !C 34 ), one cluster disappear (C 23 ! �), and one cluster emerged (�!C 32 ) during [2007,2008]. Two of the survive clusters became diffuse, while one cluster became compact than its predecessors. Likewise, one cluster survive (C 33 !C 43 ), three disappears (C 31 ! �, C 32 ! �, and C 34 ! �), and three newly emerged clusters (�!C 41 , �!C 42 , and �!C 44 ) were detected during [2008,2009]. Afterwards all four clusters disappears (C 41 ! �, C 42 ! �, C 43 ! �, and C 44 ! �), and three new clusters emerged (�!C 51 , �!C 52 , and �!C 53 ) during [2009,2010].

Intel Lab dataset
In this section, we used the publically accessible dataset recorded from 54 sensors deployed at Intel research laboratory during February 28 th and April 5 th , 2004. Each sensor record information on temperature, humidity, voltage, and light every thirty-one seconds. The dataset comprises of 2.3 million readings collected from 54 sensors. The sensors were designed to make it energy-efficient and consume power only in sensing environment and transmitting data. We select only a subset of measurements from this dataset and include readings from sensor-1 only. This subset of the data consists of 43,047 readings from sensor-1 and can be

PLOS ONE
downloaded from the URL https://www.kaggle.com/datasets/divyansh22/intel-berkeleyresearch-lab-sensor-data. We accumulate the dataset according to the landmark window model, and as the flow is uniform, so we consider 9000 records per time period. This implementation generates 5 window panes of cumulative datasets. The shadow statistic decided the optimal number of clusters in cumulative datasets at the corresponding time point. The Partitioned Around Medoids (PAM) algorithm was used for extracting clusters from datasets. Fig 6 below demonstrates the transitions of clusters at time points t i , i = 1, 2, 3, 4, 5 in Intel Lab dataset. The algorithm detect that all six clusters survive (C 11 !C 21 , C 12 !C 22 , C 13 !C 24 , C 14 !C 25 , C 15 !C 26 , and C 16 !C 23 ) while one new cluster emerge (�!C 27 ) at time point t 2 . All survived clusters experience internal transition and became more diffuse. Also six clusters survive (C 21 !C 31 , C 22 !C 32 , C 24 !C 33 , C 25 !C 34 , C 26 !C 35 , and C 27 !C 36 ) and one cluster disappears (C 23 ! �) at time point t 3 . Cluster C 24 experience double internal transition i.e. shift in location and change in density, while other clusters only became diffuse. Likewise, five clusters survive (C 31 !C 43 , C 32 !C 45 , C 34 !C 44 , C 35 !C 42 , and C 36 !C 47 ), one cluster disappears (C 33 ! �), and two clusters emerged (�!C 41 and �!C 46 ) at time point t 4 . Similarly, five clusters survive (C 42 !C 54 , C 43 !C 56 , C 44 !C 57 , C 45 !C 53 , and C 47 !C 55 ), two clusters merge ({C 41 , C 46 }!C 51 ), whereas one cluster emerge (�!C 52 ) at time point t 5 .
For further details and understanding the significance and practical applications of monitoring changes in clustering solutions of streaming datasets see Atif et al [37].

Concluding remarks
In this paper, we introduce an R package clusTransition dedicated to trace the evolution of cluster solutions in cumulative datasets. The package implements state-of-the-art algorithm MONIC for modeling and tracing the transition of cluster solutions in dynamic datasets. This algorithm is based on re-clustering of cumulative datasets D 1 , D 2 , . . ., D n arriving at corresponding time-points t 1 , t 2 , . . ., t n and monitor the changes occurring in these cluster solutions. The changes comprise of clusters that still exist, split into various, absorbed by others, disappeared and newly emerged. The clusters that survived in external transition may experience a change in location and density called internal transition. We have applied clusTransition package on synthetic as well as on real-life datasets to look insight into change detection framework.

Limitations of the package
The clusTransition package takes into account batch processing, where the stream is discretized and the gathered data is put into the windowing model. The datasets are not clustered upon arrival immediately in real time. Similarly, the use of sliding and landmark models either contain the data items or entirely ignore them at subsequent time-points. A damped window model, on the other hand, assigns each object, depending on its arrival time, exponentially decreasing weights. Future plans call for adding support for the damped window model to the R package for change detection.
The paradigm for cluster transition monitoring presupposes hard clustering, which requires that each item be assigned to one and only one cluster. This assumption implies that the strategy cannot be used to density-based or model-based clustering approaches, leaving the problem open for further investigation.